Data science | Thread Categories | Pangaea X Community

Why does model performance drop when using time-based train-test splits?
I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below. The dataset represents events over time, and the target is binary. I initially used a random train-test split,(Read More)
I’m working on a data science project with time-ordered data, and I’m seeing a significant drop in model performance once I move from training to validation. I’m sharing a simplified version of the problem and code below.

The dataset represents events over time, and the target is binary. I initially used a random train-test split, but later switched to a time-based split to better reflect real-world usage. After this change, performance dropped sharply, and I’m trying to understand whether this is expected or if I’m doing something wrong.

Here’s a simplified version of the code:
```
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# sample data
df = pd.read_csv("data.csv")
df = df.sort_values("event_time")

X = df.drop(columns=["target"])
y = df["target"]

# time-based split
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

preds = model.predict_proba(X_test)[:, 1]
print("AUC:", roc_auc_score(y_test, preds))
```
With a random split, the AUC was around 0.82.
With the time-based split, it drops to around 0.61.

I’m trying to understand:
- Is this performance gap a common sign of data leakage in the original setup?
- Are tree-based models like Random Forests particularly sensitive to temporal shifts?
- What are good practices to diagnose whether this is concept drift, feature leakage, or simply a harder prediction problem?
- Would you approach validation differently for time-dependent data like this?
Looking for general guidance, validation strategies, or patterns others have seen in similar scenarios.
Copy link 0 3 51 1 week ago
Want to undo ?

Subscriber

Erin
January 30, 2026
Future of Data Science Moving Away From Modeling and Toward Problem Framing?

Data science as a discipline is shifting faster than most people realize. A decade ago, the core skill set revolved around building models, tuning hyperparameters, crafting feature pipelines, and selecting algorithms. But with the rise of AutoML, pretrained foundation models, vector databases, and agentic AI systems, much of the “technical heavy lifting” is becoming automated(Read More)

Data science as a discipline is shifting faster than most people realize. A decade ago, the core skill set revolved around building models, tuning hyperparameters, crafting feature pipelines, and selecting algorithms. But with the rise of AutoML, pretrained foundation models, vector databases, and agentic AI systems, much of the “technical heavy lifting” is becoming automated or abstracted away.

Today, the competitive advantage is less about who can write the best model from scratch and more about who can frame the right problem, define meaningful metrics, interpret model outputs responsibly, design data loops, and understand the business impact of predictions. Even the most complex models LLMs, multimodal architectures, time-series forecasters can now be deployed with pre-built frameworks or API calls.

This shift raises an important question about the future of the field:
If modeling becomes commoditized, does the true value of a data scientist lie in strategic thinking rather than technical implementation?
Last reply by Xavier Jepsen.
Copy link 0 0 122 3 months ago
Want to undo ?

Subscriber

Miley
December 12, 2025
Why does everyone seem to be choosing data science these days?

I keep seeing a lot of people jumping into data science especially those without a tech background. Curious why this field is getting so much attention compared to others like cloud, web dev, or cybersec. Is it the salary hype? the job flexibility? or just that it sounds cooler than traditional dev roles? I’m personally(Read More)

I keep seeing a lot of people jumping into data science especially those without a tech background. Curious why this field is getting so much attention compared to others like cloud, web dev, or cybersec. Is it the salary hype? the job flexibility? or just that it sounds cooler than traditional dev roles? I’m personally torn between data science and going deeper into backend/web dev, so just wanted to hear from folks who’ve already picked a path. what made you choose data over other domains, and was it worth it?
Last reply by Zain.
Copy link 0 0 309 6 months ago
Want to undo ?

Subscriber

Dash
July 31, 2025
How to sync data from multiple sources without writing custom scripts?

Our team is struggling with integrating data from various sources like Salesforce, Google Analytics, and internal databases. We want to avoid writing custom scripts for each. Is there a tool that simplifies this process?

Our team is struggling with integrating data from various sources like Salesforce, Google Analytics, and internal databases. We want to avoid writing custom scripts for each. Is there a tool that simplifies this process?
Last reply by HitEsh.
Copy link 1 0 312 7 months ago
Arindam
Want to undo ?

Subscriber

Shamiya
July 24, 2025
Wanting guidance for tech stack of data science

Hi everyone, I’m currently an undergraduate student in Data Science, actively working toward becoming a data scientist. So far, I’ve built a foundation with basic machine learning models using libraries like Pandas, NumPy, Matplotlib, Scikit-learn, and some PyTorch. I’ve also explored LLMs by working with pre-trained models through Hugging Face and LangChain. Lately, I’ve been(Read More)

Hi everyone,

I’m currently an undergraduate student in Data Science, actively working toward becoming a data scientist. So far, I’ve built a foundation with basic machine learning models using libraries like Pandas, NumPy, Matplotlib, Scikit-learn, and some PyTorch. I’ve also explored LLMs by working with pre-trained models through Hugging Face and LangChain. Lately, I’ve been diving into more advanced ML and deep learning concepts, setting up CI/CD pipelines, and learning backend development for ML using FastAPI and Flask.

Despite experimenting with this wide range of tools and technologies, I still find myself unclear about what companies actually expect from data scientists—both at junior and senior levels. What tech stack should I focus on? Which trends and skills are truly valued in the industry?

As a student, it’s hard to get a clear answer on this. Could someone with experience in the field help clarify what companies are really looking for in data scientists today?

Thanks in advance!
Last reply by Shahir.
Copy link 0 0 272 7 months ago
Want to undo ?

Subscriber

James Benett
July 16, 2025